Part I - PISA 2012 Exploration

by Jamie Potter

Introduction

The data are from the 2012 round of the Programme for International Student Assessment (PISA), a triennial international education survey organised by the Organisation for Economic Co-operation and Development (OECD). The aim is to provide meaningful international comparisons for the educational attainment of 15-16 year olds for participant nations.

Alongside a general questionnaire that quizzed pupils on various aspects of school life and life at home, pupils take a two hour computer-based test. Scores are then scaled so that the average in each of the three areas tested (mathematics, reading, science) have mean 500, standard deviation 100. Typically, the object of most media interest is the ranking that appears at the end, where countries are ranked in terms of their mean scores. The 2012 survey also included a section on problem-solving.

The dataset was downloaded on 06/11/2021 from Udacity's server here.

Preliminary Wrangling

Using pisadict2012.csv, I'm going throught the dataset to get an idea of how various variables are encoded and trying to think about what it would be interesting to investigate. Also, the technical report ('PISA-2012-technical-report-final.pdf') is helpful here, as it gives more information on exactly how the data are created and what they mean.

It seems to me that a good selection of variables to look at would be:

We need to look at response variables now.

Ah, so what's going on here is that the researchers have made an adjustment for the fact that they're only sampling each student for a very limited amount of time, and so instead of giving each one a single score, they're sampling 5 times from the most plausible distribution. This will not give individual students an accurate reflection of their ability, but it will make the group level data more accurate.

So we have PV1MATH to PV5MATH for mathematics, PV1SCIE to PV5SCIE for science, and PV1READ to PV5READ for reading. I can't seem to find the scores for the problem-solving thing. Apparently they're supposed to be under PV1CPRO to PV5CPRO, but they're not here. What I'll eventually do is work out the mean of the five plausible scores for each student for the three scores we do have, and I'll use that as the response variables from now on.

Trying now to understand the rather extensive set of codes for 'BRR-Fay' Weights.

These Fay's Balanced Repeated Replication (BRR) Weights are a way of estimating the variability of statistics obtained by stratified sampling. You need to use this if you're going to generate any sort of inferences at all about the populations involved in the survey. This is taking me well beyond my competence level, however, so I'll just stick to talking about the dataset directly!

What I'm going to do now is start to build a restricted dataframe with only a selection of variables.

It occurs to me that it might be useful to have an overall score for each student, if only for exploratory purposes. An unweighted average of sci, math and reading is more or less appropriate.

Now I need to start cleaning the data so that all my columns are the right data types. I'll just go through these in order.

The data are now ready for some exploratory analysis. To finish off this section on Data Wrangling, let's summarise what we know so far.

What is the structure of your dataset?

Each row is an observation, i.e. the values for a single student. Orignally, all the columns were strings recording demographic information, then answers to survey questions, then the key values of interest, the plausible values for maths, science, and reading, following those. Finally, there are then 80 columns or so used to assign weights for that student's scores so population-level inferences can be made.

However, I've restricted the data to only 16 variables of interest, assigning them the correct data types in the process. Each row is still a single student, so the basic structure is still the same.

What is/are the main feature(s) of interest in your dataset?

The key response variables in the original dataset were the plausible values for maths, science and reading. Of interest would be any sort of factor that displays some sort of relationship with those scores. As I stated in the introduction, international comparisons are typically the main point of interest for the media, but I've chosen to avoid making such comparisons because I am not competent in terms of weighting each score correctly. Instead, I'll be looking at patterns inside the dataset.

Moreover, I'm working with four response variables here - math_score, sci_score, read_score, and overall_score. The first three are averages of the plausible values generated from the theorised distribution for each student based on their result, and then the overall is a simple average of those three. I would imagine, however, that I'll only be using the last variable for exploratory purposes as it's a somewhat arbitrary metric (why should reading, science and mathematics be weighted equally?).

What features in the dataset do you think will help support your investigation into your feature(s) of interest?

There are many survey questions that I'm anticipating it would be interesting to pursue: whether they have a desk at home, how many televisions they have at home, use of one-player games or collaborative games at home, if either of their parents have an undergraduate (or higher) education (ISCED Level 6), wealth, age (do older children perform better?), psychological factors (locus of control, perseverance, openness). It's possibly also worth exploring the interaction between gender and the relative scores, since a historical pattern is that students who perform worse at reading but better at maths/science tend to be male, but there may be other wrinkles with gender.

What I've done is created a smaller dataframe with these columns only so as to make things more workable as I proceed.

Univariate Exploration

I'll start by examining our response variables.

As noted in the introduction, the scales are normed so that the mean is 500 and the standard deviation is 100, which you can see reflected in their descriptive statistics below.

Thus it is not altoghether surprising that we're seeing the shape we're seeing - normally-distributed, symmetrical distributions centred on 500 with standard deviation 100. I'll now go through the explanatory variables in order, starting with age. Although this is a float, it's going to end up looking more like a category variable, I suspect, simply because it's only working with age in months.

As suspected, the ages show this discretised form. It's not a big problem, however. Another slight curio here in that the age range isn't evenly distributed across 12 months, presumably because of slightly different year group arrangements from country to country, or something of that nature. Now onto the wealth of the participants. How is it distributed?

Wealth is an index PISA generates from asking about household possessions (TVs, mobile phones, cars, having one's own room, culturally-specific items). It is roughly normally-distributed, centred around -0.34, with a standard deviation of 1.22.

Onto the sex of the students in the survey. What can we say about this?

This is a contender for the least interesting graph in history. Moving swiftly on to highest parental education.

The modal group is 'ISCED 5a, 6'. Just to explain these categories, the following is taken from the PISA 2012 Technical Report:

As you can see, the modal class is the last one. I found this a little surprising as I didn't think the proportion of people with that level of education would be that high, even taking into account that it's choosing the highest level from both the parents.

Onto number of whole days truant.

Here one can see that the vast majority of students surveyed have never had a whole day of unexcused absence from school.

For the possessions variables, I'll clump them together to speed things along a bit...

Here one can see that it's very common for students to have their own desk, and still pretty common for them to have their own room. It's also very common for pupils to have three or more mobile phones per household. It seems likely to me that these categories won't turn out to be very useful in the analysis because they don't differentiate between participants very well, but we shall see. I was expecting the students in the dataset to be a little less WEIRD (Western, Educated, Industrialised, Rich, Developed), but realistically we're looking at educational comparisons between developed countries/regions within countries. This is also manifest from the generally high level of parental education we saw previously.

For TVs, we're seeing a little more differentiation, with many in the 'one' or 'two' televisions per household bins, but it would have been nice if they'd included more categories - generally catch-all options in surveys such as 'three or more' shouldn't be modal. It's a similar story for computers. What's probably happened here is that they're sticking to a design that was appropriate when PISA was first introduced in 2000 but hasn't aged gracefully. They may be useful predictors nonetheless.

Finally, for the reported number of books per household, we see that 26-100 is the modal class. It's possible that this is a saliency effect as it feels 'about right' if you're trying to estimate something. Nonetheless, there's a good spread of different responses, and it will be interesting to investigate if there are any effects/interactions with our response variables.

That just leaves our two psychological indices - perseverance and openness to problem-solving.

As we saw for age, there are only a limited number of possible values, since the discretised form of the histogram. In both cases we have slight upticks at the extreme right tails of our distributions (3.5286 for perseverance and 2.4465 for openness to problem-solving), indicating we may not have unipolar distributions here. It is tempting to treat these extreme values as anomalous, but there isn't sufficient reason to do so here.

Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

The response variables I'm interested in - the mathematics, reading, and science scores of the students, are all normed to be of mean 500 and standard deviation 100, and they were normally distributed as such. No transformations were necessary.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

The most unusual distribution was probably the last one I looked at - the indices used for perseverance and openness to problem-solving. This was interesting because of the sudden spikes at 3.5286 for perseverance and 2.4465 for openness to problem-solving, both right at the extreme right tail of the distribution. It's possible that something like this has occurred due to students filling in their questionnaires in a way that suggested they were giving the answers they felt they should be giving (social desirability bias), but there's insufficient reason here for me to conclude that and thus exclude these values from the data. I will thus leave them as they are.

Bivariate Exploration

A good place to start here is to examine a plot matrix and examine the relationships between the quantitative variables. What relationships can we see between the quantitative variables?

There appears to be a weak-to-moderate positive correlation between perseverance and openness to problem-solving (r=0.468). Surprisingly, however, neither of the two psychological indices appear to have much of a relationship at all with any of the student score measures (r < 0.2) - if there is an effect, it is going to be quite muted. Wealth shows a weak, positive correlation with mathematics (r=0.279), reading (r=0.248), and scientific literacy (r=0.27).

In terms of the relationships between our three response variables, there is a very strong, positive correlation between each pair of variables, though it looks as though the correlation between mathematics score and reading score (r=0.881) is slightly weaker than the correlation between mathematics and science (r=928) or the correlation between reading and science (r=0.904). This strong inter-correlation aligns with the traditional view that there is a single, general factor, 'g', behind intelligence (or at least, the sort of intelligence that psychometricians typically measure), and that not much is gained by attempting to introduce a more complicated model.

Now, since I dropped age from the plot matrix, I'll now investigate it a little on its own. The question I'm interested in here is: does age have an impact on overall score?

Looking at the scatterplot above showing the relationship between age and overall score, there doesn't appear to be any relationship here. At least, if there is an effect, it's so small that it's not possible to discern from a visual inspection of the data.

Onto the effect of sex. I stated earlier that I was anticipating a sex-linked difference in the profile of scores, with the {high maths, low reading} types tending to skew male. This ought to be detectable if we compare the distributions of scores for both sexes across the three response variables of interest via, say, a violin plot.

Do the sexes differ in their distributions of scores?

Also plotting a box-plot here as it may be more useful if I end up using this for explanatory purposes:

The main point of interest here is the relatively sizeable difference in reading scores, with males scoring below females. There's also some right skew to the male distribution, indicating that there's a preponderance of lower-performing males in terms of reading when really we're expecting it to be a normally-distributed thing.

A similar thing is true, but to a lesser extent when it comes to females and mathematics. The distribution is again right-skewed when really we'd expect it to be normally-distributed.

For science, however, things are more or less even. There's perhaps evidence of greater variation amongst the males, but if there is, it's quite a small difference.

In terms of visualisation, I think I may prefer the violin plots here over the box plots because you can see the skew a bit better.

Onto highest parental education, then. What is the relationship between highest parental education level and the student scores?

Here we see evidence of a relationship between parental education level and the mean score of the student. As the education level increases, the mean score achieved by the student increases, and this effect is visible on all three response variables. The standard deviations (indicated by the black lines) are roughly the same for all categories.

We can compose a very similar plot for truancy level. What is the relationship between the number of unauthorised day absences and scores?

The effect is a fair bit less dramatic than you'd think it would be, but you do see the relationship you'd expect - pupils with a higher truancy rate also tend to have a lower reading, mathematics, and science score on the PISA test. The standard deviations appear to be more or less the same for all three response variables.

Moving on now to the possessions. Here I'm just going to do plots for the overall score and combine them into one plot.

Is there a relationship between household possessions and student score?

Unsurprisingly, there appears to be a positive relationship between a student's having a desk to work on, and their grades. Similarly for them having their own room, although it looks to be somewhat more muted, which is the opposite way round to what I'd have expected.

In general, it looks as though all of the possessions indicators have the same basic pattern - more stuff is a proxy for greater household wealth, and household wealth is associated with higher student scores.

Of slightly more interest perhaps is the books factor, since we see an apparent slight decline in the mean student score as we move from the '201-500 books' category to the 'More than 500' category. It's worth investigating further. Perhaps there is a kind of 'bookworm effect', where a certain number of books is helpful for bolstering reading, but beyond which proves detrimental to mathematics and science study?

There doesn't appear to be anything particularly different about the effect on reading here. The distributions are more or less identical, with that slight drop in mean/median as we move from the penultimate to the last category. It's also true that the student scores in the last category are more variable, and left-skewed too.

But perhaps there is a relationship between the number of books and the number of computers? Do households of bibliophiles tend to eschew computers, or does it follow the same pattern of both been indicators of underlying wealth?

The heatmap above shows the percentages of households within each book category that have 0, 1, 2, or 3+ computers in their household. As you can see, as the the number of books per household increases, the relative frequency of the computers increases too. This again seems to correspond to our basic model that all the possessions indices are really just functioning as proxies for wealth. There doesn't appear to be anything special about books.

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Much of the Bivariate Exploration phase hasn't revealed anything particularly exciting. We've seen that the 3 response variables (mathematics, reading and science scores) are tightly inter-correlated, that the psychological indices of perseverance and openness to problem-solving aren't especially important in terms of predicting scores, that truancy is associated with lower scores, but that in general any sort of indicator of wealth is associated with increasing scores, albeit to different degrees. I also detected a slight difference in the distribution of scores between males and females, with females tending to score higher on reading, and males tending to score higher on mathematics. The effect is not very big, however.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

I was hoping there might be something interesting to be seen with the number of books - e.g. a distinctive relationship with reading in particular, or some trade-off between books and computers, but there simply wasn't anything that exciting in the data, unfortunately. In general, it looks as though possessions are only interesting as indicators of wealth, and thus I'll discontinue studying possessions in their own right in the multivariate exploration phase.

Multivariate Exploration

We've already seen how there is a subtle difference in the male and female distributions when it comes to reading and mathematics scores. A natural question to ask here is if there is an interaction between sex and other factors when it comes to scores. Males seem to have a left-skewed distribution on reading, which suggests that there may be a number of males who are reading a fair bit below their potential. One question we could ask, then, is whether there's an interaction with wealth. Is the combination of (poor, male) substantially worse in terms of reading score than the combination of (poor, female)?

This visualisation here isn't very good because the overlapping colours is confusing and lacks any sort of natural interpretation, but what you can observe is that there is indeed an interaction effect here since the gradient of the regression lines are quite clearly different. Specifically, the reading scores for females increase faster as wealth increases than they do for males.

The visualisation here isn't terribly compelling either. I think I'm probably best off with a smaller sample without any transparency, a smaller marker size, and a little bit of x_jitter.

Now I just want to see if a similar thing happens for mathematics and science, and also do some calculations to back up my eyeball judgements:

Aha! An interesting effect at last. There's a definite interaction when it comes to reading scores between wealth and sex (with a regression line for females that's 22% steeper than it is for males). However, the interaction is less marked for wealth and sex whenit comes to mathematics scores and science scores (with a regression line for females that's 13% steeper than it is for males in both cases). For some reason, increasing wealth has more of an impact on females than it does males, and this effect is most pronounced when it comes to reading scores.

Another thing that might have an interesting interaction with wealth is highest parental education. Is it the case that pupils with parents who are wealthy and well-educated perform better over and above the expected effects of wealth and education alone? And which has the stronger relationship with test scores - increasing the highest parental education level, or increasing the household wealth?

The same pattern can be observed on all three response variables - generally, mean scores increase as wealth increases, except for the '1.51 to 2.5' category, where it appears to reverse. Mean scores generally also increase as one moves up the levels of highest parental education, although the effect isn't as pronounced as with wealth.

There also appears to be some interaction here. When it comes to higher wealth levels (-1.49 to -0.5, -0.49 to 0.5, and 0.51 to 1.5), it is clear that the higher the parental education level, the better. In all three of these wealth levels, the highest mean is recorded in the 'ISCED 5A, 6' category (advanced tertiary education). However, for wealth levels below -1.5, the highest mean scores are recorded in the 'ISCED 3B, C' or 'ISCED 2' categories for highest parental education.

Interpreting this result is tricky. It's possible (probable?) that this is merely an artefact of the relatively low numbers of people who tend to fall in such groups, since generally higher parental education will mean higher household wealth. It may also be that there is a demotivating effect along the lines of "Study really hard at school, and you too can earn a pittance just like me!", a sentiment that many millennials will no doubt recognise. Bromides about the joys of learning for learning's sake may not be the best for motivating teenagers.

Talking about motivation, I thought it might be good to return to our psychological indices, perserverence and openness to problem-solving. We saw earlier in the plot matrix that any correlations with our response variables aren't terribly dramatic, but I was interested in whether the combination of the two might show something more interesting.

Seeing as all three of the variables I'm interested in here (persev, prob_solv_open, math_score/read_score/sci_score) are quantitative, one option is to use a scattergraph with hue as the encoding for the response variable. I'll try this first.

What is the relationship between our psychological variables considered together and test scores?

Here we see the moderate, positive correlation between perseverance and openness to problem-solving we observed earlier, but only scant indication, if any, that either is particularly associated with higher test scores. Even if you include the effect of high self-reported values in both, there doesn't appear to be any particular effect on test scores. I won't labour the point by trying to produce other representations of this uninteresting plot.

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

The relationship between the psychological indices and test scores turned out not to be terribly interesting, unfortunately. If there is a relationship between self-reported perseverance and openness to problem-solving and test scores, it is very small.

However, I did observe effects when it came to wealth and highest parental education. Mean test scores increased as wealth increased, and as one moved up the levels of highest parental education. There was also some interaction in that, at higher wealth levels, those with the highest parental education fared the best on average in tests, but at lower wealth levels, it was those with parents at a lower secondary or vocational/pre-vocational upper secondary level that scored the highest on average.

Were there any interesting or surprising interactions between features?

The interaction when it came to reading scores between wealth and sex was an interesting feature for me. I ended up with a regression line for females that's 22% steeper than it is for males, whereas it was only 13% steeper when it comes to the mathematics and science scores. As wealth increases, this has more of an impact on the test scores of females than it does males, but this effect is most pronounced when it comes to reading scores.

Conclusions

Wealth

Sex

Wealth and Sex